Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics to RepartitionExec #398

Merged
merged 3 commits into from
May 23, 2021
Merged

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented May 23, 2021

Which issue does this PR close?

Adds metrics to RepartitionExec. Example output (with local hack to display metrics in query plan):

RepartitionExec: partitioning=Hash([Column { name: "o_custkey" }], 24) metrics=[repartitionTime=19390949,fetchTime=1354115582,sendTime=1642636]
  CoalesceBatchesExec: target_batch_size=4096 metrics=[]
    FilterExec: o_orderdate >= CAST(1994-01-01 AS Date32) AND o_orderdate < CAST(1995-01-01 AS Date32) metrics=[]
      RepartitionExec: partitioning=RoundRobinBatch(24) metrics=[fetchTime=56873223,sendTime=125282,repartitionTime=0]
        ParquetExec: batch_size=8192, limit=None, partitions=[/mnt/tpch/parquet-sf1//orders/part-0.parquet] metrics=[]

Closes #397 .

Rationale for this change

Help debug performance issues in queries.

What changes are included in this PR?

Adds metrics to RepartitionExec.

Are there any user-facing changes?

No. The metrics are not shown by default.

@andygrove andygrove requested a review from Dandandan May 23, 2021 15:53
Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I think the time calculation for round robin repartition is missing.

@andygrove
Copy link
Member Author

Looks good. I think the time calculation for round robin repartition is missing.

The new metrics don't include the time for sending the resulting batches to the channels, so the only thing to measure for round-robin would be the time to execute let output_partition = counter % num_output_partitions so I figured that was not worth measuring.

I am now wondering if we should also measure time to send the results to the channel because if this is high it could indicate that upstream operators are not fetching data as fast as they could be. I will take a look at that next.

@codecov-commenter
Copy link

codecov-commenter commented May 23, 2021

Codecov Report

Merging #398 (affa192) into master (174226c) will decrease coverage by 0.00%.
The diff coverage is 70.96%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #398      +/-   ##
==========================================
- Coverage   74.94%   74.94%   -0.01%     
==========================================
  Files         146      146              
  Lines       24314    24344      +30     
==========================================
+ Hits        18223    18244      +21     
- Misses       6091     6100       +9     
Impacted Files Coverage Δ
datafusion/src/physical_plan/repartition.rs 82.45% <70.96%> (-1.89%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 174226c...affa192. Read the comment docs.

@andygrove
Copy link
Member Author

I added the sendTime metric and this includes the cost of round-robin.

@Dandandan
Copy link
Contributor

Looks good. I think the time calculation for round robin repartition is missing.

The new metrics don't include the time for sending the resulting batches to the channels, so the only thing to measure for round-robin would be the time to execute let output_partition = counter % num_output_partitions so I figured that was not worth measuring.

I am now wondering if we should also measure time to send the results to the channel because if this is high it could indicate that upstream operators are not fetching data as fast as they could be. I will take a look at that next.

Thanks, makes sense 👍

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition!

@Dandandan Dandandan merged commit aeed776 into apache:master May 23, 2021
@houqp houqp added api change Changes the API exposed to users of the crate datafusion Changes in the datafusion crate labels Jul 29, 2021
@andygrove andygrove deleted the repart-metrics branch February 6, 2022 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement SQLMetrics for RepartitionExec
4 participants